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Abstract: Inspired by sample splitting and the reusable holdout introduced in the field 
of differential privacy, we consider selective inference with a randomized response. 
We discuss two major advantages of using a randomized response for model selection. 
First, the selectively valid tests are more powerful after randomized selection. Second, 
it allows consistent estimation and weak convergence of selective inference procedures. 
Under independent sampling, we prove a selective (or privatized) central limit theorem 
that transfers procedures valid under asymptotic normality without selection to their 
corresponding selective counterparts. This allows selective inference in nonparametric 
settings. Finally, we propose a framework of inference after combining multiple ran¬ 
domized selection procedures. We focus on the classical asymptotic setting, leaving the 
interesting high-dimensional asymptotic questions for future work. 
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1. Introduction 

Tukey (1980) promoted the use of exploratory data analysis to examine the data and 
possibly formulate hypotheses for further investigation. Nowadays, many statistical 
learning methods allow us to perform these exploratory data analyses, based on which 
we can posit a model on the data generating distribution. Since this model is not given 
a priori, classical statistical inference will not provide valid tests that control the Type-I 
errors. 

Selective inference seeks to address this problem, see Lee et al. (2013a), Lockhart 
et al. (2014), Lee & Taylor (2014), Fithian et al. (2014). Loosely speaking, there are 
two stages in selective inference. The first is the selection stage that explores the data 
and formulates a plausible model for the data distribution. Then we enter the inference 
stage that seeks to provide valid inference under the selected model which is proposed 
after inspecting the data. Inference under different models have been studied, notably 
the Gaussian families Lee et al. (2013a), Tian et al. (2015), Lee & Taylor (2014) as 
well as other exponential families Fithian et al. (2014). 

In this work, we consider selective inference in a general setting that include non¬ 
parametric settings. In addition, we introduced the use of randomized response in 
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model selection. A most common example of randomized model selection is proba¬ 
bly the practice of data splitting. Assuming independent sampling, we can divide the 
data into two subsets, using the first for model selection and the second subset for in¬ 
ference. Though not emphasized, this split is often random. Hence, data splitting can 
be thought of as a special case of randomized model selection. To motivate the use of 
randomized selection and introduce the inference problem that ensues, we consider the 
following example. 


1.1. A first example 


Publication bias, (also called the “file drawer effect” by Rosenthal (1979)) is a bias in¬ 
troduced to scientific literature by failure to report negative or non-confirmatory results. 
We formulate the problem in the simple example below. 

Example 1 (File drawer problem). Let 


1 " 

= - V X,, 


2=1 


be the sample mean of a sample of n iid draws from F„ in a standard triangular array. 
We set /i„ = [-^i,n] tind assume Ep^ [{Xi,n — = 1- 

Suppose that we are interested in discovering positive effects and would only report 
the sample mean if it survives the file drawer effect, i.e. 

> 2 . ( 1 ) 


Then what is the “correct” p-value to report for an observation Xn^obs that exceeds 
the threshold? 

If we have Gaussian family, namely = N^pn, 1). then the distribution of Xn 
surviving the file drawer effect (1) is a truncated Gaussian distribution. We also call 
this distribution the selective distribution. Formally, its survival function is 

P{t) = P (X„ > > 2) , Xn^N 

_ 1 - $ - Pn)) 

~ l-$( 2 -ni/ 2 p„) 


where <!> is the CDF of an N (0,1) random variable. Therefore, we get a pivotal quantity 


ntv ^ ^ {tl^^^iXn.obs Tn)) 

P{Xnobs) — , */„ ]^/2 \ ~Unif(0,l), 

1 - $(2 - n'-i^pn) 


^ Xn.obs P 2 , X^ Qf,g ^ N ^Pm 


( 2 ) 


The pivotal quantity in (2) allows us to construct p-values or confidence intervals 
for Gaussian families. When the distributions F„’s are not normal distributions, cen¬ 
tral limit theorem states that the sample mean Xn is asymptotically normal when F„ 




Tian and Taylor/Selective inference with a randomized response 


3 


has second moments. Thus a natural question is whether the pivotal quantity in (2) is 
asymptotically Unif(0,1) when „ does not come from a normal distribution? 

The following lemma provides a negative answer to this question in the case when 
F„ is a translated Bernoulli distribution that has a negative mean. Essentially when 
the selection event > 2 becomes a rare event with vanishing probability, the 

pivotal quantity in (2) no longer converges to Unif(0,1). We defer the proof of the 
lemma to the appendix. 

Lemma 1. IfXi^n takes values in { — 1.5, 0.5}, with P = —1.5) = P = 0.5) 
0.5. Thus = —0.5. Then the pivot in (2) does not converge to Unif(0,1) 

P(X„) 74 Unif(0,1), 

for the Xn’s surviving the file drawer effect (1). 

Randomized selection circumvents this problem. In the following, we propose a 
randomized version of the “file drawer problem”. 

Example 2 (File drawer problem, randomized ). We assume the same setup of a trian¬ 
gular array of observations Xi^n as in Example 1. But instead of reporting X„ when it 
survives the file drawer effect (1), we independently draw w ~ G, and only report X„ 

+uj>2. (3) 

Note that the selection event is different from that in (1) in that we randomize the 
sample mean before checking whether it passes the threshold. In this case, if Fn = 
N(p,n, 1), the survival function of Xn is 

P{t) = P > fn^^^Xn + w > 2 ) , (X„, w) - iV X G 

= p(z > - p.n)\Z + oj > 2 - p,n) , {Z,uj) ~ X(0,1) x G. 

(4) 

To compute the exact form of P(t), we have to compute the convolution of N(0, 1) 
and G which has explicit forms for many distributions G. Moreover, when G is Logistic 
or Laplace distribution, we have 

P{Xn,obs) ^ Unif(0,1), 

as long as F„ has centered exponential moments in a fixed neighbourhood of 0. The 
convergence is in fact uniform for — 00 < ^„ < 00. Lor details, see Lemma 10 in 
Section 5.2. 

The only difference between these two examples is the randomization in selection. 
After selection, we need to consider the conditional distribution for inference, which 
conditions on the selection event. If we denote by F* the distribution used for selective 
inference, we have in Example 1, 

d¥n^ PF„(ni/2x„>2)- 


(5) 
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We also call the ratio between F* and F„ the selective likelihood ratio. In this case, the 
selective likelihood ratio is simply a restriction to the X„’s that survives the file drawer 
effect. We observe that 


s/nXn = y/n^ln + z, 

which leads to three scenarios for selection. 

• /i„ > i5 > 0, for some (5 > 0. 

In this case, the dominant term for selection is y/nfXn, and since we have a big 
positive effect, we would always report the sample mean when n is big. This 
corresponds to the selection event having probability tending to 1 and the selec¬ 
tive likelihood ratio goes to 1 as well. In this case, there is very little selection 
bias, and the original law is a good approximation to the selective distribution 
for valid inference. 

• /r„ < —J < 0, for some (5 > 0. 

In this case, the dominant term is also but in the negative direction. As 

n oo, the selection probability vanishes and the selective likelihood becomes 
degenerate. We almost never report the sample mean in this scenario, but in the 
rare event where we do, by no means can we use the original distribution for 
inference. 

• —6 < < <5, for some 5 > 0. 

This corresponds to local alternatives. In this case, the selective likelihood neither 
converges to 1 or becomes degenerate. Rather, it becomes an indicator function 
of a half interval. Proper adjustment is needed for valid inference in this case. 

It is in the second scenario that pivotal quantity (2) will not converge to Unif(0,1). 
Different distributions will have different behaviors in the tail. Since the conditioning 
event v^l'^Xn > 2 becomes a large-deviations event, we cannot expect it to behave 
like the normal distribution in the tail. 

On the other hand, in Example 2, if we denote by F* the law for selective inference, 
we have 

dF* ^ G(2 - ni/2x„) G (2 - n^l^{X^ - p„) - n^/Vn) 

dFf (G(2 - ni/2x„)) “ [G (2 - n^/^X^ - /i„) - 

(6) 

where G{t) = G{du) is the survival function of G. When < —5 < 0 for some 
d > 0, and G is the Laplace or Logistic distribution so that G has an exponential tail, 
the dominant term exp(n^/^/i„) in both the numerator and the denominator will cancel 
out, making the selective likelihood ratio properly behaved in this difficult scenario. 

It turns out that this selective likelihood ratio is fundamental to formalizing asymp¬ 
totic properties of selective inference procedures. Its behavior determines not only the 
asymptotic convergence of the pivotal quantities like in (4), but also whether consistent 
estimation of the population parameters is possible with large samples. 

Again in the negative mean scenario where < —5 < 0, the sample mean Xn 
surviving the non-randomized “file drawer effect” cannot be a consistent estimator for 
the underlying means /i„ because it will always be positive. But if Xn is reported as in 
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Example 2, it will be consistent for /i„ even if /i„ is negative and bounded away from 
0. For detailed discussion, see Section 3. 

In general, the behavior of the selective likelihood ratio can be used to study the 
asymptotic properties of selective inference procedures. We study consistent estimation 
and weak convergence for selective inference procedures in Section 3 and Section 5 
respectively. 

We are especially inspired by the field of differential privacy (c.f. Dwork et al. 
(2014) and references therein) to study the use of randomization in selective infer¬ 
ence. Privatized algorithms purposely randomize reports from queries to a database in 
order to allow valid interactive data analysis. To our understanding, our results are the 
first results related to weak convergence in privatized algorithms, as most guarantees 
provided in the differentially private literature are consistency guarantees. Some other 
asymptotic results in selective inference have also been considered in Tibshirani et al. 
(2015), Tian & Taylor (2015), though these have a slightly different flavor in that they 
marginalize over choices of models. 

We conclude this section with some more examples. 

1.2. Linear regression 

Consider the linear regression framework with response y G M", and feature matrix 
X G with X fixed. We make a homoscedasticity assumption that Cov [y\X] = 

cr^/, with cr^ considered known. Of interest is 

p = E(y|X), 

a functional of F = F(2f) the conditional law of y given X. When F is a Gaussian 
distribution, exact selective tests have been proposed for different selection procedures 
Tibshirani (1996), Taylor et al. (2014), Tian et al. (2015). Removing the Gaussian dis¬ 
tribution on F, Tian & Taylor (2015) showed that the same tests are asymptotically 
valid under some conditions. 

Randomized selection in this setting is a natural extension of these works. Fithian 
et al. (2014) proposed to use a subset of data for model selection, which yields a signifi¬ 
cant increase in power. In this work, we study general randomized selection procedures. 
Consider the following example. 

Due to the sparsity of the solution of LASSO Tibshirani (1996) 

j5x{y) = aigTXm]-\\y- XP\\l + \- ||/3||i, 

/3gRp ^ 

a small subset of variables can be chosen for which we want to report p-values or 
confidence intervals. This problem has been studied in Lee et al. (2013fl). However, 
instead of using the original response y to select the variables, we can independently 
draw w ^ Q and choose the variables using y* = y + oJ. Specifically, we choose subset 
E by solving 

Pxiy,uj) = argmin^lly* - XPWl + A • ||/3||i, y* = y + uj, 

/3gKp ^ 


(7) 
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and take = supp(/3\(y, w)). In Section 4.2.2, we discuss how to carry out inference 
after this selection procedure, with much increased power. We also discuss the reason 
behind this increase in Section 4.2. 

J.3. Nonparametric selective inference 

All the previous works on selective inference assume a parametric model like the Gaus¬ 
sian family or the exponential family. In this work, we allow selective inference in a 
non-parametric setting. Consider the following examples. 

Suppose in a classification problem, we observe independent samples, 

G RP X {0,1}. 

with fixed p. This problem is non-parametric if we do not assume any parametric struc¬ 
ture for F and are simply interested in some population parameters of the distribution F. 
In Section 5, we developed asymptotic theory to construct an asymptotically valid test 
for the population parameters of interest. More details can be found in Section 5.4.1. 

Also consider a multi-group problem where a response x is measured on p treatment 
groups. A special case is the two-sample problem where there are two groups. It is of 
interest to form a confidence interval for the effect size in the “best” treatment group. 
This arises often in medical experiments where multiple treatments are performed and 
we are interested to discover whether one of the treatment has a positive effect. The fact 
we have chosen to report the “best” treatment effect exposes us to selection bias and 
multiple testing issues Benjamini & Hochberg (1995), and therefore calls for adjust¬ 
ment after selection. Benjamini & Stark (1996) have considered the parametric setting 
where Xj N{pLj,a‘^) for each group. Suppose for robustness, it is of interest to 
report the median effect size instead of the mean (assuming responses are not symmet¬ 
ric). Then without any assumptions on the distribution of the measurements, this also 
becomes a nonparametric problem. But we can apply the theory in Section 5 to cope 
with this problem, for details, see Section 5.3. 

1.4. Outline of the paper 

There are three main advantages of applying randomization for selective inference, 

• Consistent estimation under the selective distribution 

• Increase in power for selective tests 

• Weak convergence of selective inference procedures 

In the following sections. Section 2 gives the setup of selective inference and in¬ 
troduced selective likelihood ratio, which is the key for studying consistent estimation 
and weak convergence of selective inference procedures. Section 4 focuses on linear 
regression models with different randomization schemes, demonstrating the increase in 
power. Section 5 proposes an asymptotic test for the nonparametric settings. Theorem 
9 proves that the central limit theorem holds under the selective distribution with mild 
conditions. Applications to the two examples in Section 1.3 are discussed. This is a 
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result for fixed dimension p. Finally, Section 6 discusses the possibility of extending 
our work to the setting, when multiple selection procedures are performed on different 
randomizations of the original data. One application is selective inference after cross 
validation for the square-root LASSO Belloni et al. (2011). 

2. Selective Likelihood Ratio 

We first review some key concepts of selective inference. Our data D lies in some 
measurable space {T>, F), with unknown sampling distribution Z? ^ F. Selective in¬ 
ference seeks a reasonable probability model M - a subset of the probability measures 
on (P, F), and carry out inference in M. Central to our discussion is a selection algo¬ 
rithm, a set-valued map 

Q:V^Q ( 8 ) 

where Q is loosely defined as being made up of “potentially interesting statistical ques¬ 
tions”. 

For instance, in the linear regression setting, T) = K", our data D = y and we have 
a fixed feature matrix X G The unknown sampling distribution is F = £(yjX), 

the conditional law of y given X. 

A reasonable candidate for the range of Q might be all linear regression mod¬ 
els indexed by subsets of {1,... ,p} with known or unknown variance. For any se¬ 
lected subset of variables E, we carry out selective inference within the model M = 

{ZV(XB/3£;,a2j),/3B 

Since we use the data to choose the model M, it is only fair to consider the condi¬ 
tional distribution for inference, 

D\MgQ{D), d~f. 

Therefore, we seek to control the selective Type-I error: 

Pm, ffo (reject Hq \ M G Q) < a (9) 

where M is the selected family of distributions in the range of Q and Hq C M is the 
null hypothesis. Selective intervals for parametric models M can then be constructed 
by inverting such selective hypothesis tests, though only the one-parameter case has 
really been considered to date. 

2.1. Randomized selection 

Randomized selection is a natural extension of the framework above. We enlarge our 
probability space to include some element of randomization. Specifically, let H denote 
an auxiliary probability space and Q is a probability measure on H. A randomized 
selection algorithm is then simply 


Q* :Vxn^Q. 
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Note the randomization is completely under the control of the data analyst and hence 
Q will be fully known. This is an extension of the non-randomized selective inference 
framework in the sense that we can take Q to be the Dirac measure at 0. Many choices 
of Q* are natural extensions of Q, which we will see in many examples. 

Randomized selective inference is simply based on the law F*, which we also call 
the selective distribution, 

£I|M e Q*(D,a;), (£1,0;)-F X Q. (10) 

Note that although randomization is incorporated into selection, inference is still car¬ 
ried out using the original data D, after adjusting for the selection bias by considering 
the conditional distribution F*. 

Similar to the selective inference we defined above, we seek to control the selective 
Type-I error, 

Pf* (reject Hq) = Pm, (reject Hq\M S Q*) < a. (11) 

Moreover, we also want to achieve good estimation, which makes 

Er*((%)-0(F))') (12) 


small. 

In Sections 3 to 5, we will discuss concrete examples of T), D, F and Q*. But 
before that we first introduce the selective likelihood ratio, which is a crucial quantity 
in studying the selective distribution F*. 


2.2. Selective likelihood ratio 


Selective likelihood ratio provides a way of connecting the original distribution F and 
its selective counterpart F*. It is easy to see from (10) that the selective distribution 
is simply a restriction of the (£),a;)’s such that model M will be selected. Thus F* is 
absolutely continuous with respect to F, and the selective likelihood ratio is 


dF* 

If 


(D) 


Ef(W(M;D)) ^ 


W(M; £>) = Q ({w : M e Q*(£i,a;)}) 


VF e M, 


(13) 


The numerator in £v{D) is the restriction of {D, oj), integrated over the randomiza¬ 
tions w, and the denominator is simply a normalizing constant. One implication of the 
selective likelihood ratio is that for distributions F in parametric families, their selective 
counterparts may have the same parametric structure. 


2.2.7. Exponential families 

One commonly used parametric family is the exponential family. Assume that F = Fg 
is an exponential family with natural parameter space 0 and T) = M" and the data 
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D = y. Its density with respect to the reference measure dF g is, 

^{y) = exp{9'^T{y)-^{e)}, 0 € &. (14) 

Through the relationship in (13) we conclude, for any randomization scheme, the 
law g is another exponential family. Formally, 

Lemma 2. IfFg belongs to the exponential family in (14), then for any randomized 
selection procedure Q*, the selective distribution is also an exponential family, 

cx W(M; y) exp{0^r(y) - ^^(0)}, 0 e 0. 

cflf 0 

with the same sufficient statistic T(y) and natural parameters 9. 

Furthermore, to test Hoj : 0j = 0, we consider the following law, 

TM\T-Ayl (15) 

The first claim of the lemma is quite straight-forward using the relationship in (13). 
The second claim is a Lehmann-Scheffe (c.f. Chapter 4.4 in Lehmann (1986)) con¬ 
struction which was proposed in Fithian et al. (2014), to construct tests for one of the 
natural parameters treating the others as nuisance parameters. For detailed construction 
of such tests in the linear regression setting, see Section 4. 

3. Consistent Estimation After Model Selection 

In this section, we leave the parametric setup and consider general models M. In partic¬ 
ular, we study the consistency of estimators under the selective distribution for arbitrary 
models. We first introduce the framework of asymptotic analysis under the selective 
model. Then we state conditions for consistent estimation in Lemma 3 and conclude 
with examples. 

For any model M, which is a collection of distributions, we define its corresponding 
selective model, which is the collection of corresponding selective distributions, 

r dF* 1 

M* = |f* : —(i9)=4(d?),FeM|, (16) 

where is the selective likelihood ratio for the selection event {M S Q*}. Selec¬ 

tive inference is carried out under the selective model M*. 

In order to make meaningful asymptotic statements, we consider a sequence of ran¬ 
domized selection procedures (Q*)n>i and models with each M„ in the 

range of Q*. 

Often, we are interested in some population parameter which can be thought be 
as a functional of the distribution F„ S Mn, 


0n ■ Mn —>■ K. 
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It is worth pointing out that is selected by Q*, which already incorporates the 
statistical questions we are interested in. In this sense, is chosen a posteriori. The 
selected model M* does not change our target of inference, it merely changes the 
distribution under which such inference should be carried out. In other words, if is 
the mean parameter, we are interested in the underlying mean of F„, not F*. 

We might have a good estimator : I? —> M for 0„(F„) under F„, namely 





0n(F„))' 


^ 0 . 


On is a consistent estimator if our model M„ is given a priori. But as we use data select 
Mn, what really cares about is its performance under the selective distribution F*. Will 
this estimator still be consistent under the selective distribution F* ? 

Formally, we say an estimator On is uniformly consistent in for 6*„(F„) under the 
sequence if 


limsup sup \\0n - 6'„(F„)||ip(F„) 0. 

n F„eM„ 

Similarly, we say that is uniformly consistent in probability for the functional (F„) 
under the sequence if for every e > 0 there exists 6{e) > 0 such that for all 

S > Sie) 

limsup sup Vn{\0n - 6'n(F„)| > ( 5 ) < e. 

n F„eM„ 

The following lemma states the conditions for consistency of under the sequence 
of corresponding selective models (M*)„>i, 

Lemma 3. Consider a sequence {Qn^Mn)n>i of randomized selection procedures 
and models. Suppose the selective likelihood ratios satisfies, for some p > 1, 

limsup sup ||4 „||lp(f„) < C'- (17) 

n F„GM„ 

Then for any sequence of estimators On uniformly consistent for 0„(F„) in L“, it is 
also uniformly consistent for 0„(F„) in under {M*)n>i, 7 < a/g, 7 + 7 = 1. 

Further, if On is uniformly consistent for On in probability, then On is uniformly 
consistent for On in probability under the sequence (M*)„>i. 

Proof Let A„ = On — 0„(F„). To prove the first assertion note that for any F* € M* 

l|A„||LnF;) = / |A„r4„(j/)F„(d2/) 

7-d„ 

< II|A„P||l<i(f„)IKf„(2/)||lp(f„) 

= II Anll^TP^Fn) 11'^®'" ( 21 ) 

< I|A„|||^o(f„)II4„(2/)||lp(f„) 

< ^^l|An|II.(F„) 
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For any 6 > 0, 


F;(|A„| >6)= [ 1{|A„| > (5}4„(2/)F„(d2/) 

"'■Dr. 

< [F„(|A„| >5)]'/«||4JUp(f) 

<C[F„(|A„| 

□ 

We illustrate the application of Lemma 3 through our “file drawer effect” examples 
in Section 1.1. 


3.1. Revisit the “file drawer problem” 


First we note that in Example 1 and 2, we observe data = (Ai ..., A„ „), with 
Xi,n F„. The randomized selection in Example 2 can be realized as 


Q*{Dr.,u;) 


report p-values for ify^A„ + a; > 2, 

do nothing, if^/nXn + uj < 2, 


where we independently draw w ~ G. 

By law of large numbers, we easily see that if we always report A„, it will be an 
unbiased estimator for p„. However, since we only observe the sample means surviving 
the hie drawer effect. Will A„ still be consistent for 

In the most difficult scenario discussed in Section 1.1, where /i„ < —6 < 0 for some 
5 > 0, Xn cannot be a consistent estimator for in Example 1 . This is easy to see as 
Example 1 will only report positive sample means. A remarkable feature of randomized 
selection is that consistent estimation of the population parameters is possible even 
when the selection event has vanishing probabilities. In fact, the following lemma states 
that when G is a Logistic distribution, A„ is consistent for pn after the randomized hie 
drawer effect in Example 2. 

Lemma 4. Suppose as in Example 2, we observe a triangular array with Aj „ ~ F„. 
F„ has mean pn = P < 0- If we draw uj ^ Logistic(«;), where k is the scale of 
the Logistic distribution. Then the sample means A„ surviving the “randomized” file 
drawer effect are consistent for p,. 


Xn P, conditional on ^/nXn + U! > 2. 


if¥n has moment generating function in a neighbourhood ofO. Namely, 3a > 0, such 
that 

Ef„ [exp (a \Xi^n - Pn\)] < G. 

Before we prove the lemma, we want to point out that although the selection proce¬ 
dure in Example 2 is different from that in Example 1 because of randomization, ^Jnpn 
is still the dominant term in selection. Note that 


^/nXn +UJ = \/npn + Vn{Xn - Pn) + W- 
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Since both ^/n{Xn — fJ'n) w are Op(l) random variables, the dominant term 
—oo, would ensure that the selection event has vanishing probabilities in 
Example 2 as well. Thus it is particularly impressive that Example 2 gives consistent 
estimation where Example 1 cannot. The proof of Lemma 4 is deferred to the appendix. 

We also verified this theory of consistent estimation through simulations. Eigure 1 
shows the empirical distributions of the sample mean after the hie drawer effect 
in Example 1 or the “randomized” hie drawer effect in Example 2. They are marked 
with “blue” colors or “red” colors respectively. We set the true underlying mean to 
be /i„ = /r = — 1 and mark it with the dotted vertical line in Figure 1. The upper 
panel Figure la is simulated with n = 100 and the lower panel Figure lb is simulated 
with n = 250. We notice that in both simulations, the sample mean in Example 1 
concentrates around the thresholding boundary, which is positive. Thus, these sample 
means can not be possibly for the underlying mean /r = — 1. However, the existence of 
randomization allows us to report negative sample means. As a result, the sample mean 
in Example 2 will be consistent for p = —1. We see that as we increase sample size n, 
the sample means concentrates closer to p = — 1. 

4. Inference in linear regression models 

In the linear regression setting, we assume a hxed feature matrix X G and ob¬ 

serve the response vector D = y G K". We assume the noises are normally distributed. 
There are two ways to parametrize a linear model, and both belong to some exponential 
family. Now we introduce the selected model, 

Msel{E) = [N{XEPE.(T‘"l)- Sc p} (18) 

with (T^ known or unknown or the saturated model, 

M,at = yiGW^] (19) 

with known variance. Now we consider some randomized selection procedures and 
inference after selection. 

4.1. Data splitting and data carving 

In the introduction, we introduced data splitting Cox (1975) as a special case of ran¬ 
domized selective inference. In Fithian et al. (2014), the term data carving was in¬ 
troduced to demonstrate that data splitting is inadmissible. In data splitting (and data 
carving) inference makes most sense in the selected model Msei{E), hence we should 
think of Q as returning a subset E of variables selected. 

Let us formalize this notion in our notation. Let Q be some measure on assignments 
of n data points into groups and Q a selection algorithm dehned on datasets of any 
size. The distribution Q determines a randomized selective inference procedure with 
selection algorithm Q*, an algorithm applied to subsets of the original data set. In this 
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(a) n = 100 


X after file drawer effect 


X after randomized file drawer effect 



• 1.2 - 1.0 - 0.8 - 0.6 - 0.4 - 0.2 0.0 0.2 0.4 



.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 


(b) n = 250 


Fig 1; Empirical distributions of sample means in Example 1 and Example 
2, with original or randomized file drawer effect. Eor the randomization, we draw 
oj ~ Logistic(K), with k = 0.5. 
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case, it is easy to see that 

W(-B; y) = W{MseiiE);y) cx ^ • ^{M,^i{E)GQ(yi(y,Lu))} 

LJ 

where is the mass assigned to assignment oj by Q. Multiple assignments or splits 
considered in Meinshausen et al. (2009), Meinshausen & Btihlmann (2010) can be 
formalized in a similar fashion. We can construct UMPU tests for /3e in the selected 
model Msei{E) by using Lemma 2, (also see Fithian et al. (2014)). We note that in 
Fithian et al. (2014) the authors conditioned unnecessarily on the split oj, and we would 
expect that aggregating over splits would yield a more powerful procedure. 

However, there are two disadvantages with this randomization scheme. First, it is 
computationally difficult to aggregate over all random splits. Second, it seems difficult 
to consider the saturated model Msat for inference, which is more robust to model 
misspecifications. To overcome those difficulties, we introduce other randomization 
schemes below. 


4.2. Additive noise and more powerful tests 

Our second randomization scheme in linear regression involves additive noise. Specifi¬ 
cally, we draw w ^ Q and use the randomized response y*{y,uj) = y + uj for selection 
In this case, we can consider both the selected model Mgei.E and the saturated model 
Msat. Per Lemma 2, we can perform valid inference for Pe in Msei,E or linear func¬ 
tionals of y, in Msat- 

One major advantage of using a randomized response y* for selective inference is 
that these procedures yield much more powerful tests, at a small cost of on the quality 
of the selected models. In other words, small amount of randomization is cause a small 
loss in the model selection stage, but we gain much more power in the inference stage. 

The reason for increased power can be explained by a notion called leftover Fisher 
Information first introduced in Fithian et al. (2014). Since selective inference is es¬ 
sentially inference under the selective distribution F*, the Fisher Information under 
F* would determine how efficient the selective tests are. In the saturated model with 
Gaussian noise Msat, is the score statistic and its variance under F* is exactly the 
leftover Fisher Information (a similar relationship holds in the selected model Msei ,e)- 
Lemma 5 gives a lower bound on this leftover Fisher Information when the randomiza¬ 
tion noise Q = 7V(0, 7 ^/). 

Lemma 5. For either Msat or Msei{E), if we use Gaussian randomization noise Q = 
A^(0,7^), and the selection is based on Q{y*) = Q{y + w), then the leftover Fisher 
information is bounded below by 

(l-r)Z(0), r = aV(a 2 + 72 ), 

and T[0) is the non-selective Fisher information for 0 in Msat or Msei{E). The pa¬ 
rameters 9 depend on which of the two models we are considering. 
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Proof. In the saturated model the score statistic is F = Since Q{y*) is 

measurable with respect to y*. 


Var 


1 


^ I Q(y*) > Var [1/ I y*] = —Var [y \ y^] . 

L J (j^ 

Since y and y* = y + a; are both normal distributions with covariance matrices, 
Gov [y, y*] = a'^I, Var [y*] = ( 7 ^ + cr^)/, 
we have the leftover Fisher Information 


Var 

1 


V\Q{y*)] > 4 var[ 2 /| j/*] 




(7^ 


^ /) = ^(1-t)J= (l-r)I(^). 


7 ^ + cr^ 


In the selected model Msei,E, the score statistic is V = XePe) ^ Similarly, 


Var 


1 


V|Q(y*) > ^Var [V|y I 2/*] 


(7 


74 

7 ^ + cr 


a\xlXE) - -f^iXlXE) 


= ^(1 - t)XIXe = (1 - tWe). 


□ 


When there is no randomization 7 = 0, we potentially have no leftover Fisher infor¬ 
mation. This corresponds to a very rare selection event. However after randomization, 
even with very extreme selection, there is always leftover Fisher information, which 
makes the selective tests more powerful. Consider the following examples. 


4.2.1. Revisit the “file drawer problem” 

In Example 1 and Example 2, if we assume F„ = N{p,, 1), they are a special case of 
the linear regression model, with the feature matrix X = 1, the all ones vector. 

In this case, nV„ is the score statistic, and its variance under the selective dis¬ 
tribution is the Eisher information. Lemma 5 states that the leftover Eisher informa¬ 
tion is lower bounded by n(l — r) if we draw randomize using Gaussian variables, 
Q = iV(0, 7 ^), r = 1/(1 -I- 7 ^). 

Moreover, the increase in leftover Eisher information with randomization is not spe- 
cihc to Gaussian randomizations. Eor example, in Eigure 1 when we use Logistic ran¬ 
domization, we also observe that under the selective distribution with randomization, 
Xn has a much bigger variance than without randomization. As discussed above, this 
variance multiplied by nf is exactly the leftover Eisher information, which explains 
why selective procedures after randomization will have better performances than with¬ 
out. 

We investigate the relationship between the leftover Eisher information and the 
length of conhdence intervals constructed by inverting the pivot in (4). Specifically, 
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(a) Gaussian added noise (b) logistic added noise 

Fig 2: Selective confidence intervals for different added noise 


in Example 2, after observing a reported sample mean, we want to report confidence 
intervals for the underlying mean p,. 

Figure 2 demonstrates the selective intervals (solid lines) after (3) with uj being 
either Gaussian or Logistic noises. The sample size n = 100. Unlike the nominal 
confidence intervals (dashed lines), the selective intervals are valid with 90% coverage 
for the underlying mean. Since Lemma 3 gives a lower bound of (1 — r)I(p), we 
would intuitively expect the selective confidence intervals to be 1/(1 — r) the length 
of the nominal intervals. This is verified in Figure 2a, when we observe really negative 
sample means. (The sample means can be negative because we added randomization.) 
On the other hand, for Logistic randomization in Figure 2b, the intervals are slightly 
wider than the nominal intervals around the 2/i/n, but narrow to roughly the nominal 
size on both sides of the truncation point. This indicates that added logistic noise might 
preserve more information than Gaussian additive noise. Both additive noises improve 
significantly over a non-randomization scheme (c.f. Figure 3 in Fithian et al. (2014)). 

Of course, the increase in power and shortening of selective confidence intervals 
does not come without a price. Because we select with a randomized response, we are 
likely to select a worse model. But the trade-off between model quality and power is 
highly in favor of randomization. See the following example. 


4.2.2. Linear regression with added noise 

Back to the general setup of linear regression models, we select a model by solving 
LASSO with the randomized response y* = y + u) and return the active set E of the 
solution (as in (7)). Then per Lemma 2, we can construct valid selective tests in both 
Msat Mgei {E). For instance, in Mgej {E), we can construct tests for the hypothesis 
Hoj '■ Pj = 0, j G E based on the law, 

v'^y\^E{y + w) < bs, PeXjV, {y, w) ~ N{XeI3e, cr'^I) X Q, Pj = 0, (20) 

where 77 = ej, Cj is the j'-th column of the identity matrix, PE\j is the projection 

matrix onto the column space of E but orthogonal to p, and Ag, 6 b are the appropriate 
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Fig 3: Comparison of inference in additive noise randomization vs. data carving. 


matrix and vector corresponding to LASSO selection. This is a UMPU test due to the 
Lehmann-Scheffe construction (Fithian et al. 2014) and controls the selective Type-I 
error (11). Although, we cannot compute the explicit forms of (20), the selection events 
in (20) are polyhedrons and thus a hit-and-run or Hamiltonian Monte Carlo algorithm 
Pakman & Paninski (2012) can be used for sampling. 

Figure 3 compares inference in the additive Gaussian noise scheme to the data carv¬ 
ing procedure proposed in Fithian et al. (2014) as well as data splitting. In Msei{E), 
the probability of screening (i.e. selecting E including all the nonzero /3’s) is a sur¬ 
rogate for the quality of the model. As additive noise uses a different randomization 
scheme than data splitting and data carving, we vary the amount of randomization used 
in each scheme and match on the probability of screening. Thus Figure 3 is like an ROC 
curve for the trade-off between model quality and power of tests. The cc-axis goes in 
the direction of increased randomization, with the left most point corresponding to no 
randomization at all. We see even with a small randomization that barely affects model 
selection, we can substantially lower the Type-11 error from 0.2 to less than 0.05. The 
trade-off is highly in favor of (small) randomization. We see in Figure 3 that additive 
noise lowers the Type-11 error by almost half than data carving for the same screening 
probability and they both clearly dominate data splitting. For the concrete setup of the 
simulation, see Chapter 7 of Fithian et al. (2014). 

5. Weak convergence and selective inference for statistical functionals 

In the nonparametric setting, we assume a triangular array of data, = (di.nj • • ■ j dn,n), 

and di,n When F„ = F, it is the special case of independent sampling. We are 

interested in some functional of the distribution = /t(F„). Associated with is 
our statistic T which is a Unearizable statistic (Chung et al. 2013). 

Definition 6 (Linearizable statistic). Suppose di^n IFn. we call T a Unearizable 
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statistic for = ^(^n) ony sample size n, 


1 , . 

T{Dn) = — ^ n + ^z,n = ^{di n)j 

2=1 

E [c,,„] = /i„ G Cov G 


where ^ a function of the data and R is bounded with probability R = Op{n~^) 
under F. We use the slight abuse of notations to denote as iid random variables as 
well. 


Throughout this section, we assume the dimension p is fixed. We are interested in 
establishing a pivotal quantity for r„ = T{Dn) like (4) in Example 2 where Tn is the 
sample mean after the randomized “file drawer effect”. It turns out we have an exact 
pivotal quantity if T„ is normally distributed. To lighten notation, we suppress the script 
n in the following lemma, which is a finite sample result valid for any n. We prove the 
lemma in Section 7. 

Lemma 7. If the statistic T is normally distributed from N{p,, and the model M 
is selected by randomized selection Q*(T, w), where uj ^ Q. Then for any contrast i], 
which could depend on the outcome of selection Q*, we have 


P(r;77^p,S) 


■exp(-n(f 
Q(i; Vp) ■ exp(-n(f 


Tf'p.f/2al) dt j.. 
rfpYl^a’y) dt 


Unif(0,1) 


( 22 ) 


where 

Q(i, Vp) = Q Qtu : M G Q* • Sp/a^ + Vp, u?j . 

Remark 8. In selected models Msei,E, the selection is often made not only based on 
(r, (jj), but also other statistic of the data, which we call the null statistic N. Thus 
the selection event should be expressed as {M G Q*((T', iV),w)}. To make notation 
simpler, we exclude such possibilities. But a slightly modified pivot where we replace 
Q(f; Vjj) with Q(f; V^, N) in (22) and integrate over N, is still Unif(0,1) distributed. 

Note that Lemma 7 provides a valid pivotal quantity for any randomized selection 
procedure Q* and any randomization noise Q provided that T is normally distributed. 
In fact. Lemma 7 does not require T to be a linearizable statistic. In some sense, the 
lemma is a reformulation (after rescaling) of the selective tests constructed in the linear 
regression model with additive noises (see Section 4.2.2). Lor example, in the selected 
model Msei.E, to test the hypothesis H^j : (ij = 0, j G E, we consider the law 
(20). After introducing the null statistic N = P^y, the pivot in (22) is in fact the 
CDL transform of this law, taking T = PeV, L = na'^Ps, and the selection event 
{M G Q*{{T, N),u})} to be the affine selection event defined in (20). With simple 
calculation, it is easy to see = (Pe — ||??|i~^ = PE\riy^ which we condition 

on in both ( 22 ) and ( 20 ). 
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Of course the pivot in (22) is very difficult to compute explicitly, and we need to 
use sampling schemes like in (20). But in a nutshell, P{T; ji, S) is simply a CDF 
transform of the law 

TfT\V^,M&Q*{T,uj), xQ. (23) 

After introducing the null statistic. Lemma 7 is agnostic to the selected model Msei,E, 
where p, = XeI3e or the saturated model Mgat, where the parameter is simply p. 
The nuances between the two models in terms of sampling is that the saturated model 
condition on N (treating it as part of V^), but selected model integrate over N. 

Lemma 7 is written with T implicitly being the approximate average of n i.i.d vari¬ 
ables, hence the distribution iV(p, ^). Linearizable statistics are of particular interest as 
they converge to iV(p, —) due to central limit theorem. In the following, we seek to es¬ 
tablish conditions under which the pivot P{T\ p, E) will be asymptotically Unif(0,1). 


5.1. Selective central limit theorem 


In other work on asymptotics of selective inference Tian & Taylor (2015), Tibshirani 
et al. (2015), the setup considered is usually the saturated model Mgat- These works 
considered asymptotics of selective inference marginalized over the range of Q*. In 
contrast, we consider the convergence for any particular selected model M„, under 
the conditional law of the selection event {M„ G Q^}- Specifically, we allow weak 
convergence of the pivot in (22) in the sequence of selected models (M„)„>i. As ex¬ 
plained above, selected models integrate over the null statistics while saturated models 
condition on those, thus the selective tests should have more power provided that the 
selected model is believable. In the saturated model, our result provides a finer measure 
of convergence than in Tian & Taylor (2015). On the other hand, Tian & Taylor (2015) 
allows high-dimensional setting in some cases while we consider fixed dimension p. 

Similar to the asymptotic setting in Section 3, we consider the convergence of 
P{Tn]'q ^Under a sequence of models selected by a sequence of 

selection procedures (Q*)„>i. {Tn) „>i is a sequence of linearizable statistics defined 
in Definition 6, with asymptotic mean p„ and asymptotic covariance matrix —. 

It turns out that in this setting, the selective likelihood ratio again plays an im¬ 
portant role in the convergence of the pivot. Recall that with randomized selection 
Q*{Tn,uj), the selective likelihood is 


^F„ {Tn, Mn) 


EF„[W(r„;M„)]’ 


W(T„;M„) = Q({u;:M„e Q:(r„,w)}) 


(24) 


It will be convenient to rewrite the likelihood ratio in terms of the normalized vector 


— yjTi{Tn l^n) 


f-Vr.{Zn) = ^v{n ^^’^Zn+Pn)- 


(25) 


as well as the pivot (22) 


P¥„{Zn) = P{n -f p„;? 7 jp„,E„). 


(26) 
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Our approach is basically a comparison of how the pivot will behave under F„ and 
its Gaussian counterpart <!>„ = A^(/r(F„), I](F„)). Specifically, it is a modification 
of the proof of Theorem 1.1 of Chatterjee (2005), modified to allow for the fact the 
derivatives of the pivot and the likelihood are not required to be uniformly bounded. 
Given a norm 17 on M*’, define 

= sup {||5''/(s)||’’'^''exp(-rl7(s)) : 1 < A: < r|, (27) 

where 9^' denotes the k-fold differentiation with respect to the p-dimensional vector s, 
|j • II denotes element wise maximum. 

Now we state our selective central limit theorem, which we prove in Section 7. 

Theorem 9 (Selective central limit theorem). Suppose the statistics Tn = T{Dn) are 
linearizable statistics according to Definition 6 . We also assume the norms 17 : —>■ K 

are such that for each f S {Pm }, it satisfies 

sup A^(/)<Ci. (28) 

F„GM„ 

Moreover, assume has uniformly bounded moment generating function in some 
neighbourhood ofO. Namely, 3a > 0, such that 

sup sup IEF,^(exp(a||^i,„ -/i(F„)||i)) < (72. (29) 

n>l 


Furthermore, we assume 


T 1/2 € Q*] - P($ xQ)[A 7„ e Q* 

limsupn ' • —^- 


*(■!>„ xQ) [A7n £ Q 

Then, for any g with uniformly bounded derivatives up to third order 


<^ 3 . 


(30) 




[g{P{Tn))] - [ g{x)da 
Jo 


<n ^/^7T(p,(7i, (72,(73,p), n > no (31) 


where K depends only on the bounds on the derivatives of g, the constants (7i, (72, C 3 
and the dimension p. Thus the convergence is uniform in {Mn)n>i for models satisfying 
(28), (29) and (30). 

Theorem 9 provides a finite sample bound on the convergence of the pivot P{Tn). 
Since we allow g to be functions with uniformly bounded derivatives up to the third 
order, (31) implies convergence of P(T„) to Unif(0,1) under F*. In the following 
examples, we show how to verify conditions (28), (29) and (30). 


5.2. Revisit the “file drawer problem” 

In Examples 1 and 2, we considered only reporting an interval or a p-value about /i„ 
when r}/‘^Xn > 2 or nfl'^Xn + w > 2. This is an example where we do not really 
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select a model, but rather select only a proportion of the data to report. The selective 
distribution simply refers to the law of the reported sample means, which pass the 
threshold. 

The data we observe is „) with the linearizable statistic r„ sim¬ 

ply being the sample mean X„. Example 1 corresponds to the degenerate random¬ 
ization of adding 0 to Work of Tian & Taylor (2015) show that in order for the 
corresponding pivot to converge weakly we can take, for A < 0 hxed 

M„ = {f : > n-^^^A,EF[Xfj < cx)} . (32) 


That is, Xn will satisfy a selective CLT when the population mean is not too negative. 
On the other hand, in Example 2, the pivot in (22) is of the form. 


P{Xn) 


/£ G(2 - dz 


(33) 


and likelihood £r^ (Xn) is defined in (6). 

When G is the Logistic noise, then condition (28) and (30) can be verihed. Eormally, 
we have the following lemma whose proof we defer to the appendix. 

Lemma 10. If G = Logistic(K), with k being the scale parameter, then if centered 
Xian’s have moment generating functions in the neighbourhood of zero, then the pivot 
P(Xn) is asymptotically Unif(0,1). 


In other words, with Logistic randomization noise, we can take the sequence of 
models to be 


Mn = : Ef„ [exp (a |Ai_„ — /r„|)] < oo} , for some a > 0. (34) 

Requiring exponential moments is stricter than the third moment condition in (32), but 
we would have a stronger conclusion, namely weak convergence uniformly over all 
dn S. 


5.3. Two-sample median problem 


In the two-sample median problem, we have two treatment groups from which we take 

measurements, xu ~ Fi and X 2 i ~ E 2 ; for simplicity of notation, we assume we 
observe n samples from each group, and drop n in the subscript. We will report the 
bigger median from this group in the non-randomized setting. Exact formulation of 
randomized selection will be discussed below. 

Suppose our underlying distribution isF = FixF 2 . Let p = (pi, P 2 ) is the popula¬ 
tion median of the two groups, and T = (Ti ,T 2 )is the sample median. The well-known 
result by Bahadur (1966) states that the sample median is a linearizable statistic for the 
median when the CDE of the distribution F has positive density /, and /' is bounded 
in a neighbourhood of the population median m. Eormally, if Xi F, then the sample 
median 


rrr ^ 1 > to} - 1/2 

Tixi, . . . , x„) = TO + - ^ - + Rn 


(35) 
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with R = 0{n log n) with probability 1. 

Our (randomized) selection algorithm Q* reports 

fp(T;pi,S), ifTi >T2 + n-i/2a; 

\p(T;P 2,S), ifTi <T2 + n-i/V 

where ui ^ Q and E = diag(|/i(/ri)“^, j/ 2 (M 2 )~^) is a diagonal matrix, fi, /2 are 
the densities of Fi and F 2 . Without loss of generality, we suppose Mi is selected, i.e. 
the first group is the “best” group. 

We choose the randomization noise Q to be a Logistic(K) with mean 0 and k is the 
scale, and let be the CDF. The resulting pivot for is 

_ If G^i^t - y/nPa) • exp(-n(f - I2al) dt ^ _ 1 

- y/nT 2 ) ■ exp{-n{t - fii)'^/2af) dt’ ^ 4/i(pi)2 

This pivot strikes a similarity with the pivot in (33) for Example 2 with the truncation 
threshold 2 being replaced by y/riT 2 and plugging in the appropriate means and vari¬ 
ances of the medians. A result similar to Lemma 10 can be established, which ensures 
convergence of the pivot uniformly for any underlying medians {^i, ^ 2 )- 

In order to construct the above pivot, we need knowledge of the variance af. Without 
selection, there are natural estimates of this variance. One may ask, how will inference 
be affected if we plug this estimate into our pivot? We revisit this question in Section 
5.5. 

5.4. Affine selection events 

In this section, we discuss the special case of affine selection events (regions). This 
combined with the asymptotic result in Theorem 9 applies to more general settings. In 
particular, it allows us to approximate non-affine regions. For a concrete example, see 
Section 5.4.1. 

We drop the subscript n where possible to simplify notations. Suppose for our model 
M, the selection is based on (T, w), and the selection event {M G Q*} can be de¬ 
scribed as 

{ ^/uAm T -\- ui G Km }, 

where the affine matrix Am G and Km is a region in Many examples of non- 
randomized selective inference can be expressed in this way (c.f. Lee et al. (2013h), 
Taylor et al. (2014), Lee & Taylor (2014), Fithian et al. (2015)). In this section, we 
provide conditions under which Theorem 9 can be applied. 

We again normalize T to be Z = ^/n(T — /i), then the selection event can be 
rewritten as 

{Am{Z -f A) -f w G Km}, (36) 

where y/n^ = A, Z converges to N{0, E). 

Suppose w ^ Q, which has distribution function G. Then we introduce some con¬ 
ditions on the selection region Km and the added noise distribution G, 




Tian and Taylor/Selective inference with a randomized response 


23 


Lower bound: We assume there is some norm h, such that 


f G{dw) > G exp 

— inf h(w) 

Ikm-b 

w£Km—^ 


V6» G 


Smoothness: Suppose G has density g, we assume the first 3 derivatives of g are 
integrable, 

[ \\d^9iw)\\dw <Cj,j = 0,1,2,3 


where the norm on the left-hand side is the maximum element-wise of the partial 
derivatives. 


The above two conditions essentially require G to be differentiable and have heav¬ 
ier tails than (or equal to) exponential tails. In fact we prove that the lower bound and 
smoothness conditions ensure that (28) are satished under the local alternatives intro¬ 
duced below. 

Definition 11 (Local alternatives). For the sequence of selected model we 

define the local alternatives of radius of B to be the set all sequences such 

that 

Km^ — Am„^) < B, a = 
where dh{‘,‘) is the distance induced by the norm h. 

The notion of local alternatives is natural in the asymptotic setting as we expect even 
a small effect size will be more prominent when we collect more and more data. 

Formally, we have the following lemma, whose proof is deferred to the appendix. 

Lemma 12. Suppose G, Km satisfy the lower bound and smoothness conditions, then 
condition (28) are satisfied under the local alternatives. 

Now, we are left to verify conditions (29) and (30). Condition (29) is essentially 
a moment condition on the centered statistics which we have to assume. 

Condition (30) can be verihed using the well known results in multivariate CLT (see 
Gotze (1991)). To be rigorous, we state the following lemma, which we also prove in 
the appendix. 

Lemma 13. is such that the centered statistics ^i,n~Fn have finite third moments, 
then under the local alternatives, condition (30) is satisfied. 

To summarize. Lemma 12 and Lemma 13 state that if G has integrable derivatives 
and exponential tails, then the pivot in (22) converges to Unif(0,1) uniformly for F* 
so long as F„’s are such that have exponential moments in a neighbourhood 

ofO. 

Unlike the sample mean and sample median examples, the pivot is difficult to com¬ 
pute explicitly in this case. However, as we discuss in the beginning of Section 5, the 
pivot is essentially the CDF transform of the conditional law (23), which we can sample 
from. As discussed above, we can just take oj to be from a Logistic distribution. 

Now we apply the above theory to logistic regression. 
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5.4.1. Example: randomized logistic lasso 


Suppose we observe independent samples, di = {yi,Xi) ^ F, where yfs are binary 
observations and Xi G The ordinary logistic regression solves the following prob¬ 
lem. 


P = argmin^g]5p^(/3) 


= argmin^gjip 


n 

'^y^\og'K{x,p) + (1 


yi)^Og{l - TT{XiP)) , 


(37) 


where ^{x) = exp(a;)/(l + exp(x)). This is a nonparametric setting as we do not 
assume any parametric structure for F. 

The randomized logistic lasso adds an iy penalty, a randomization term and a small 
quadratic term, 

P = argmin^gRp^£(/3) -f w^/3 -f |jA/3||i -f (38) 


where utj ~ Logistic(K) is the perturbation to the gradient and A is a diagonal matrix 
which introduces (possibly) unequal feature weights, k controls the amount of random¬ 
ization added. The addition of the quadratic term ensures that (38) is strictly convex, 
thus has a unique solution. A similar formulation for linear regression has been pro¬ 
posed in Meinshausen & Biihlmann (2010). 

Selective inference in this setting has not been considered before. Without the Gaus¬ 
sian assumptions Lee et al. (2013a) does not apply. The parametric setting of this prob¬ 
lem has been discussed in Fithian et al. (2014), but computation of the selective tests 
are mostly infeasible for general X. Finally, the asymptotic result by Tian &. Taylor 
(2015) does not apply here as the framework require exactly affine selection regions, 
which is not the case in this setting. 

Suppose the solution to (38) has nonzero entry set E, then our target of inference 
/3|;, the unique population minimizer which satisfies 

Mxl{y-<XEP*E))] = ^- (39) 

Note that a parametric model ypXi ~ Bernoulli(7r(a;i_£;/3|;)) with independently sam¬ 
pled Xi& will have P^ satisfying (39). But we by no means assume such an underlying 
distribution. Rather, for any well-behaved distribution F, P*^ can be thought of of a 
statistical functional of the underlying distribution F, depending on the outcome of 
selection E. 

Selective inference in this setting is carried out conditioned on {E, se), the active 
set and its signs. We first introduce the following notations, 

t^e{Pe) = ^ ^lipiXE^E) ’ “ diag(7r£;(/3£;)(l - tte{Pe))), 

Qe{Pe) = —XeWe{Pe)Xe, Ce{Pe) = —X'^e^e{Pe)Xe, 
n n 

De{Pe) = Ce{Pe)Qe^We) 
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where X is the feature matrix, and X^, X_e is the columns corresponding to the 
active set and inactive set respectively. By law of large numbers, we have 

Qe{P%) 4 EfQs(/3|;) 4/ Q, Ce{PI) 4 V.rCE{P*E) "= 4 

^ r 40) 

De{I3*e) 4 CQ-i 4/ D. 


Now we introduce our linearizable statistics and show that the conditioning event 
{E, se) can be expressed as affine regions of these statistics. 

Lemma 14. Suppose E is the active set of the solution of (38), and we denote 


Pe = argmin — 
/3 eGR® 


n 

ffui \0gTt{Xi^EPE) + (1 


yi)\og{l - tt{x^^eI3e)) 


as the unpenalized MLE restricted to the selected variables E. 

The following statistic T is linearizable with asymptotic mean (/3|;, p) and variance 
'Ejn, 


T = 


R, 


Pe 

.^X^E [y-TtE{PE)\, 
where R = Op{n~^/‘^) is a small residual, and p = E \xj _E{yi ~ 7r(a;i,£;/3J))]- More¬ 
over, the selection event {E, = [E, se)} can be characterized as the affine region 

{^/nAMT + Bm^ < bM}, where 


(Se 
Am = 0 

V 0 



(SeQ-^ 
Bm = D 

V -D 



(—SeQ~^Aese\ 
bM = A__e) — DAese , 

\X-E + DAeSeJ 


where I-e denotes the identity matrix ofn — \E\ dimensions and Ae, A-e denote the 
active block and the inactive block of A respectively and A is the diagonal elements of 
A, Se = diag(s£;). 

The proof of this lemma is also deferred to the appendix. 

Thus using Lemma 12 and Lemma 13, we can conclude under local alternatives, the 
pivot (22) converges to Unif(0,1). To test iJoj : /3* = 0, we take rj = Cj, and sample 

I Vr,, ^AmT + Bmu: < &m, (T,cc) ~ N 4, 

where p = E [xj_E(jJi — E{xi^EP*E))\ ■ Since p is the nuisance parameters for testing 
Hqj, j G E, the conditional law above will not depend on its value. A hit-and-run 
algorithm for sampling this law can be implemented. Moreover, recent development 
by Tian et al. (2016), Harris et al. (2016) propose more general and efficient sampling 
schemes for this law. For details, see for example Chapter 3.2 Tian et al. (2016) where 
the sampling scheme for this very example is considered and simulation results are 
provided. 

In Lemma 14, we assume the covariance matrix E is known. In applications, we can 
bootstrap it. But is it valid to plug in the bootstrap estimate of E? 
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5.5. Plugging in variance estimates 


In Section 5.3 we derived quantities that were asymptotically pivotal for the best me¬ 
dian, up to an unknown variance. In the sample median case, by (35), the variance of 
the sample median is approximately [4n/(TO)^]“^, where f{m) is the PDF evaluated 
at the median m. A simple consistent estimator for f{m) is to take 1/2 ± ^ quantiles 
a„ and bn, then 



(41) 


is consistent for f{m) based on which we get a consistent estimator for af. 

More generally, computing the pivot (22) requires knowledge of S. In practice, we 
usually do not have prior knowledge of the variance E and need a consistent estimate 
for E. We might use a bootstrap or jackknife estimator. When p is hxed, the bootstrap 
estimator is consistent and thus we get a consistent estimator E. Lemma 3 states that 
under moment conditions on the likelihood, E will be consistent for E under F* as 
well, justifying the plug-in estimator of E. 

Figure 4 is some simulation results for the two-sample medians problem. In each 
case, we take the sample size for each treatment group to be 500, and generate the 
noise from a skewed distribution N{0, 1) + 0.5Exp(l). We standardize it such that 
the noise has median 0 under the null hypothesis. We use additive logistic noise with 
scale K — 0.8 for randomization. The better group is decided using the randomized 
sample median, and selective inference is carried out. In Figure 4a, the pivot with plugin 
variance estimate d in (41) is plotted under both the null hypothesis Hq : pbetter = 0 
and the Ha '■ ^better > ■ The pivot has reasonable power even for identifying local 

alternatives. The pivot is almost exactly Unif(0,1) under the null hypothesis with the 
sample size n = 500. In fact, it is very close at a relatively small sample size n = 50 
justifying the application of asymptotics in the nonparametric setting. Figure 4b further 
illustrates the difference in the unselective v.s. selective distribution and its convergence 
to its theoretical limit. We see that there is a clear shift in selective distribution that 
calls for adjustment for the selection. For sample size n = 500, the empirical selective 
distribution converges to our theoretical distribution. 


6. Multiple Randomizations of the Data 

Most of the examples above focus on a single randomization w on the data, which we 
use for model selection. We naturally want to extend it to multiple randomizations, and 
multiple randomized selections, which will collectively suggest a model for inference. 
In this section, we allow multiple randomizations in a possibly sequential fashion and 
discuss how inference can be carried out. 
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(a) Null and alternative pivot for the “bet- (b) Selective v.s. unselective distribution, 
ter” median and theoretical PDF 


Fig 4; Asymptotic distribution of the median for the selected group 


6.1. Selective inference after cross-validation 

Consider the case where we first choose a regularization parameter by cross-validation, 
and then fit the square-root LASSO problem Belloni et al. (201 1) at this parameter, 

P\{y;X) =argmin||t/--A/3||2-f A||/3|i, (42) 

4 

where A is picked from a fixed grid A = [Ai,..., A^]. The discussion below is not 
specific to selection by square-root LASSO. 

The model selected by cross-validated square-root LASSO involves two steps of 
selection. We denote by j/cv the response for selecting the randomization parameter, 
and 2/seiect the response vector for fitting the square-root LASSO at the selected regu¬ 
larization parameter A. Both vectors are randomized version of the original vector y. 
Inference after cross validation requires combining two steps of randomized selection. 
Consider the following procedure. 

First, we randomize y to get the vector ycv and t/seiect 

Winterly, ^ ^ Niy.afl) 

ycvlyinterj y ?^ ^{yinteij ^2,C'V^) (^3) 

yselect|yinterj ~ -^(l/interj <72 select^)- 

Note the intermediate vector j/imer is introduced convenience of sampling. The above is 
just one of the plausible randomization schemes. 

After having randomized, we select A with AT-fold cross-validation using ycy'- 

A = X{ycv,X) = aigmmy^^j^CVKiycv,X,X) (44) 

where CVxiy, X, A) is the usual AT-fold cross-validation score with coefficients esti¬ 
mated by the square-root LASSO. Alternatively, one could compute the cross-validation 
score using the OLS estimators of the selected variables. Note that we have left implicit 
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the randomization that splits observations into groups. That is A in (44) above is a func¬ 
tion of {ycv, X, uj) where w is a random partition of {1,..., n} into K groups. When 
we sample ycv below, we redraw uj each time. 

The subset of variables and signs is selected using the square-root LASSO with 
response y^eiect- 

E{ycv,yssifict,X) = : Px{ycv,x),j ^ 

ZE{ycv,ysdBcuX) = 


After seeing the selected variables E, we perform inference in the selected model 
Msei{E)- Since Msei{E), we will still have an exponential family after selection. Per 
Lemma 2, we sample from the following law, 

C {xJy\\{ycv,X) = X,{E,z^) = {E, ze), Pex^v) ■ 

The additional conditioning on the signs are for computational reasons. In fact, recent 
development in Harris et al. (2016) proposes sampling schemes that overcome these 
difficulties, so that we do not need to condition on this additional information. 

To sample from the above law, we use a Gibbs-type sampler, which iterate over y, 
J/inter, Vcv and t/seiect. Conditional on the other three and the selection event. It includes 
the following steps. 


Sampling j/cv Using the conditional independence of ycv and j/seiect given yinta, we 
have 

E ^J/CV |yinteri J/selecti J/j A(2 /CVj = A^ = A^(yinter! 0'2,CV^) 1■^(2/CVi = Aj’ . 

This is the computational bottleneck, as we do not have good description for 
the selection event for cross validation. A brute-force sampling scheme will be 
computationally expensive, as we need to refit the model over a grid of A’s. Thus, 
we do not update ycv too often. 

Sampling j/seiect The conditional independence of t/seiect and ycv given j/inter implies, 

E ^t/select I yinterj yCV; (U, (ycV; yselecti — {T/, , A(ycv? — A^ 

= -^(yinteri 0’2,select'^) 11 (-^A j (yselect: = (U,Z£;)j- 


Tian et al. (2015) has given an explicit description of the selection event 

{-^V’^E.A = (E.Ze)}- 

Thus hit-and-run sampling provides a tractable sampling scheme. 

Sampling yinter This is a simple step. Because the selection event is based on ycv and 
yseiect, we have 


E (yinter|y: yseiecti yCv) 


=N 


-ycv 


yseiect 


:22 + 


1 

r2 

2,CV 


a. 


2.CV 


2, select 


*^2, select 
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Sampling y This is also simple with our randomization scheme. Note that y is condi¬ 
tionally independent of t/seiect and ycy given t/inter, 


C (y I yinterj yCV; 2/select) — C (y|yinter) — ^ 





Since we condition on PE\jy, we essentially take y and project out the update 
on the space orthogonal to that of Xj. 

A chain that iterates through the above four steps will give us samples from the 
desired distribution for inference. 


6.2. Collaborative selective inference 

One of the motivations of the reusable holdout described in Dwork et al. (2014) is that 
it allows a data analyst to repeatedly query a database yet still be able to approximately 
estimate expectations even after asking many questions about the data. Another version 
of this model may be that several groups wish to model the same data and then, as a 
consortium, decide on a hnal model and be able to approximately estimate expectations 
in this hnal model. We might call this collaborative selective inference. 

Formally, suppose each of L groups has its own preferred method of model selec¬ 
tion, encoded as selection procedures {Qi)i<i<l- We assume there is a central “data” 
bank that decides what “data” each group is allowed to see. We express this is as 
a sequence of randomization schemes (y;*)i</<L. Formally, this is equivalent to en¬ 
larging the probability space to T> x B with measure F x B and hxing a function 
y*{y,uj) = {yl{y,uj),..., y*{y,uj)). It may be desirable to choose the law of y*|y so 
that the coordinates are conditionally independent given y, though it is not necessary. 

Now suppose that the L groups choose models Mj* = Qi (yf) S a{yl ) and convene 
to discuss what the best model is M. For every choice of L models (Mi,..., Ml) 
and hnal model M, the following selective distribution can be used for valid selective 
inference 

(FxB)(n/liQ* =M/) 

When the are conditionally independent given y then it is clear that 

L 

M{ntiQi{y*{y,u;)) = Af) = l[M{Qfyt{y,u;)) = Mi). 

1=1 

It is possible that the consortium has beforehand decided on an algorithm that will 
choose a best model automatically, determined by some function S{Mi,..., Ml). In 
this case, one should use the selective distribution 

^ B(a; : S{Ml{y,uj),..., Ml{y,uj)) = M) 
dF (FxB)(5(M*,...,M£) =M) 


( 47 ) 
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When the models in question are parametric, perhaps Gaussian distributions, and the 
randomization is additive Gaussian noise the central data bank can explicitly lower 
bound the leftover information by 


Wai{y\y*^,... ,yl). 


This quantity is expressible in terms of the marginal variance of y and the central 
data bank’s noise generating distribution for y*{y,uj) = {y + wi ,... ,y + ojl)- By 
maintaining a lower bound on the above quantity, the central data bank can maintain 
a minimum prescribed information in the data for final estimation and/or inference. In 
a sequential setting, where valid inference is desired at each step, maintaining a lower 
bound may involve releasing noisier and noisier versions of y. Sampling under this 
scheme seems quite difficult, and we leave it as an area of interesting future research. 

7. Proof 

7.1. Proof of Theorem 9 

To prove Theorem 9, we first prove the following lemma, which might be of indepen¬ 
dent interest. 

Lemma 15. Suppose Tn is a linearizable statistic for = /r(F„) as defined in (21). 
Let Zn = y/ri(Tn — pn) € and a function f : —>■ M with finite for 

some norm CL on K^. Moreover, if F„ has finite centered exponential moments in a 
neighbourhood of zero. Then 


|Ef„ [/(^n)] - [/(Z„)] I < C{p)X^if)n-K n>no 


for some uq > 1, where C(p) is a constant only dependent on the dimension. 

Lemma 15 can be seen as an extension of the result by Chatterjee (2005) in the sense 
that the author in Chatterjee (2005) established result for the case 12 = 0. The proof is 
also an adaptation of the technique in Chatterjee (2005). 

Proof. Without loss of generality, we assume Tn = X residual is 0. 

First, we define the normalizing operator. For any S G 



where S'[i] is the /-th row of S. 

We also define for any n and 0 < k < n, 


Sn,k = ^k,n ~ = 

^k+l,n 


ik-l,n 


0 

^k+l,n 


V ^n,n-p{^n) / 


\ in,n - PiVn) / 
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where ~ Fji with mean ii{¥n) and variance S(Fji) and Cin ^ S(F„)). 

Let Fn^k- Pyj, denote the distribution of Sn,k and S~ respectively. Note ^ and 
F~I, are determined by F„. For simplicity of notation we only distinguish the two 
distributions by Fn^k and F~f,, avoiding verbose notations of S, e.g. = 

^ ['S'n.fc]- It is then easy to see = J\f{Sn,o)- 
Now by telescoping: 

|Ef„ [/] - E$„ [/] I = |Ef„.„ [/ o J^{S)] - Ef„,„ [/ o Af{S)] I 

n 

< Y, [/ o J^{S)] - E^-^ [/ o J^{S)] + E^-^ [/ o J^{S)] - Ef„ , [/ o AA(5)] I. 

i=l 

Let di be the derivative with respect to the i-th row S'[*]. Using Taylor’s expansion 
at 5“^, we have 


EF„,[/oAn-E^-_[/oAn 


= ^E^-^ [d,foN{S)f 0 + ^Tr [e^-^ {djfoN{S)) ■ S(F„) 


+ Rn 


where the precise form of the Taylor remainder Rn^i depends on realizing the laws Fn^i 
and on the same probability space. In order to not introduce new notation, we have 
avoided explicitly writing out this construction, directing readers to Chatterjee (2005) 
for details. Nevertheless, 


\Rn,i\ < ci(p)[A 3 (/) • n "]Ep-_ 


exp (^U(AA(5)) + IICII? 


where are centered version of and ci is some dimension dependent constant. 

Let C(U) be the constant s.t U(-) < C'(U)|| • ||i, (7(12) only depends on the dimen¬ 
sion p. Thus, using the independence of the 


Ep- 


exp ( f2(A/'(S')) 


min) lie; 


5°„ll? 


<E^-. [exp(C(U)|iAA(5)||i)]-EF 


liei.nllfexp 


fcmuir 


I 


Now we bound these two expectations. By the exponential moment condition (29) 
and Lemma 17, it is easy to conclude the first term is bounded by 


limsupEp^ [exp((7(U)||Z„||i)] < C 2 {p). 

n 

The second expectation is bounded by 7, an upper bound on the third moment of 


lim sup Ep^ 

n 


lie? 


I? exp 




n 


< 7, 
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Thus it is not hard to see 

\Rn,z\ < Ci(p)c2(p) 7A3 (/)n"5. 

Notice the first and second order terms in J/ o JV] — E^- [/ o JV] cancel with 
those in Ej;’^ [f o J\f] — E^.- [/ o J\f\, and therefore we have, 


\ErJf]-E^Jf]\<Y,iRn,^ + Ru,^) 

where is the remainder of [/°*^ [foJ\f]. With a similar argument 

\Rn,i\ < ci(p)c2(p) 7A3 (/)n"5, 

and summing over n terms, we have the conclusion of the lemma. □ 

Now we prove the main theorem. Theorem 9. 

Proof. First, notice that per Lemma 7, we have E$. [g o P(T„)] = f g{x)dx. Using 
the selective likelihood ratio, it is easy to see, 

Ef; [9{P{Tr,))] = Ef„ [g{P{Tn))ev^{Tn)] 

The same equation holds for = fl>(F„), thus we have 

|Ef* [5(P(r„))]-E<,. [p(P(T„))]| < 

|Ef„ [p(P(r„))£$„ (r„)] - E$„ [p(P(T„))€$„ (T„)] I + (48) 

|Ef„ [5(P(r„))4„(T;)] -Ef„ [p(p(7;))4„(t„)]| 


We need to bound both terms. Recall the notation P„ and for the normalized 
statistic Zn- If we let / = g{Pn) ■ then per Lemma 15, we have 

|EF„[5(Pn) • 4J - E$„[5(P„) • 4J| < 2C(p) • Cin-i/2, 

where we use the bound in condition (28). Now we replace with £r^ in the second 
term. With some algebra, we can bound it by 


Ef„ [p(P(r„))4„(T„)] 


t^F„xt 


1 - 


P<[. 


Me Q*(r„,w) 


M € Q*iT^,uj) 


which in turn is bounded by 


P(f,,xq)[m e g*] -P($„xq)[m e Q*] 
E($„xq)[m e Q*] 


< C(p)C3n-i/2, 


per condition (30) and C{g) is a bound on g. 


□ 








Tian and Taylor/Selective inference with a randomized response 


33 


7.2. Proof of Lemma 7 


Proof. Let^^ denote the density for ^E) and T = ^E? 7 -(? 7 ^r)+V^, we see 

that Q{ri'^T, is W(M; T) in (13). Thus under the selective law F*, the distribution 
of T has density proportional to 

Since rj^T _L V), under F, we can factorize (j)^ iE(^) into the product of densities of 
77 ^ r and K;. Thus conditioning on Vrj, the density of rj^T is proportional to 


exp 


n{rf'T — 

2cr2 




Therefore, the pivot in (22) is the survival function of rf'T under F* and is distributed 
as Unif(0,1). Moreover, we note the distribution does not depend on the conditioned 
value of Vri, thus P(T; S) in (22) is Unif(0,1). □ 
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Appendix A: Proof of Lemma 1 

Proof. First we normalize the sample mean as Z = ^Jn{Xn + 0.5) and rewrite the 
pivot as 

1 - $(Z) 


P{Z) = 


1 ^ 

> 2 -p 


i-^2 + iv^y 

and <1) is the CDF of the standard normal distribution. As n —> oo, we can use Mills 
ratio to approximate the normal tail. Specifically, denote + 2, 


1-^Z) ^ K 
1 - $(6„) Z 


— exp 


= -exp 


-\(bn-Zf 


]^{b^ + Z){bn- Z) 


exp [-6„(Z - 6„)], Z > bn- 


(49) 


We study the behavior of —bn{Z — bn) for Z > bn- By studying its distribution, we 
will also see that Z — A 0, for Z > 5n, thus the term 


7 -) 

Rn = — exp 


-fbn-Z)^ 


1, as n —)■ 00 . 


Now we study the distribution of 5„(Z — bn) conditioning on Z > bn- Since 
is a translation of a binomial distribution divided by n, we can rewrite Z in terms of a 
Binomial distribution, which will be useful for calculating the conditional distribution 
of bn{Z — bn). Specifically, 

2Sn-n T,- f 

Z= -, 5„~Bin(n,-). 

\/n 2 


Thus for f > 0, 

P {bn{Z -bn)>t)=¥ (^bn ^ 

-P(^5„> v^+-n+—j - g 

To study the conditional distribution P (5„(Z — bn) > t \ Z > bn), we essentially need 
to study the ratio of two partial sums of binomial coefficients. 

Note for any n,m G n > mwe have 


(„-i) 

C) 


m 


n — m + 1 


Noticing that for any k <m, thus 


E m—1 /n\ 

z=o U; 


< 


m 


EZoil) -n-m+l 
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Now let m = jn— y/ri, and use the above inequality j = times, we have 

EZ 7 C) . f m y 

Er=o(”) -\n-m + l) • 

Therefore we have, 


P {bn{Z - bn) > t \ Z > bn) 


FjbnjZ - bn) > t) 

V {bn{Z - bn) > 0) 


|n + + 1 


^ exp[- log(3)f] 


(50) 


We can draw two conclusions from (50). First, conditional on Z > bn, Z — bn Z 0, 
which implies the hrst term in the pivot approximation (49) 1. Moreover, (50) 

shows that the overshoot bn{Z — bn) is not Exp(l) distributed in the limit. In fact, 
we can conclude its limit (if existed) is strictly stochastically dominated by an Exp(l). 
Thus, 

exp [-bniZ - bn)] 7 ^ Unif(0,1), 

and hence the pivot does not converge to Unif(0,1). □ 


Appendix B: Proof of Lemma 14 

Proof. We hrst prove that T is in fact a linearizable statistic. Since is the restricted 
MLE, we see that 


0= [y-TTEiPE)] 
n 

= ~^E[y ~ '^e{I3e)] + Q{Pe ~ Pe) + Ri, 

where Ri — (Qe(/3e) ~ Q){I^e ~ Pe) + Ri, where Ri = Op(n“^/^) is the residual 
from the Taylor’s expansion at /3|,. i?i = since deviations Qe{Pe) from its 

asymptotic mean should be Op{n~^^‘^) and ^e — P*e = Op (I). 

Thus, we can deduce 

Pe =—Q ^— EEi^E)] F [i*E + Q ^Ri- 

Similarly, 

-X'^E [y - ee{XePe)] = -X'^E[y-EE{l3*E)\--DX'^[y-EE{l3*E)\+Op{n~'^/Z 
n n n 

Thus we can conclude that T is a linearizable statistic with 

c ^ f Qx^Eiy^ - e{x^^eP*e)) \ 

\Xl-Eiy^ - <X^,eP*e)) - DX,^E{y^ - e{xeeP*e))) ■ 
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(51) 


Now we rewrite the selection event in terms of {T,uj). Using the KKT conditions of 
(38), 

x'^iy - t^e0e)) = \/n(w + kz) + 

SE^E > 0, ||u_£;||oo < 1, 

SB = sign(/3£)) and u-e is the subgradient for the inactive 

U-eJ 

variables. Using a Taylor expansion on the /3e as well, we see that 

0 


where z = 


^X [y TrE0E)] n ix’^^ly - TTEiPE)] 


) (c) 


Plugging in the equalities in the KKT conditions, we will have. 


pE = Pe - -^Q ^{uje + kEZE) + Op{n 

Vn 

I^XI^eIv - ■^e{Pe)] = I^XTeIv - t^e{Pe)] + C{Pe - Pe) + Op(n"^/^) 

= -X'^E[y - tte{Pe)] + -f=D[u}E + SeZe) + Op(n“^/^) 

Using the inequalities in the KKT conditions, we have the selection event is {AmT + 
Smw < 6m} with Am, Bm and 6 m defined in the lemma. 

□ 


Appendix C: Proofs related to Logistic noise 


Throughout the article, logistic noise has played an important role in all the examples. 

The following lemma on the tail behavior of the logistic distribution is crucial to all 
the proofs with added logistic noise. Let G be the CDF of Logistic(K), with k being 
the scale parameter, g is the PDF of G. 


^r\i ixj UC (Z> 




1 + ' (l + 

Lemma 16. The following lower bounds hold, 

G{kw) > g{w) > 


, 2 ■ 


For fc = 0,1, 2,3,... .• 


Qk 


dw^ 


9{w) 




(52) 


(53) 


where Gk’s are universal constants. 
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Proof. We can write 


9{w) 


w<0 


where ho{x) = (1 + x)~'^. For j > 1, define hj{x) = x ■ hj_i{x). By induction, 
I claim that for each j, hj is rational such that the polynomial in the numerator is of 
order 2 less than the denominator, and the denominator polynomial is bounded below 
by 1. Hence, hfs are bounded on the interval [0,1]. Now, it is not hard to see that 

du;k I ^k+i (e«“')e'=“ w < 0 

for universal Cj-,/c’s and fc = 0, 1 , 2,.. .. □ 

Now we state the following lemmas which are foundations of the proofs of various 
lemmas in the article. 

Lemma 17. Assume Tn is a decomposable statistic and hcis mean 0, variance a^, 
and centered exponential moments in a neighbourhood of zero, i.e satisfies condition 
(29). Denote = y/n{Tn — /r„), then 

E [exp (kZ„)] —exp I —-— J , for k > 0. 


Lemma 18. In Example 2, if we normalize the sample mean = ^Jn(Xn — Pn), 

can rewrite the selective likelihood ratio and the pivot as 

p ( ry \ _ G(2 — Zn — y/npin) 

[G(2-Z„-V^pJ]’ 

and 

_ ff G(2-t- ^/np,n) exp(-G/2) dt 
I-oo ^(2 - t - y/nptn) exp(-f2/2) dt 

Then for any F„ with finite centered exponential moment in a neighbourhood of 
zero, we have for fc = 0,1,2,3 


< Gi exp[K|Z|], ^PiZ) < Gi. (54) 

for some Ci only depending on k. 

Proof of Lemma 17. 

Proof. Without loss of generality, we assume Tn = - ^2^=1 ^i,n- Since F„ has cen¬ 
tered exponential moments in a neighbourhood of zero, it is each to see 


E [exp (KZn)] 


M 



n 
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exists as long as ^ < a. M{-) is the moment generating function of „ — /i„, 
M{t) = [exp(f(^i_„ — ^))]. Therefore, 


lim n log 

n—^oo 


M 


^ lim 

t —^0 


log [M{Kt)] 


= hm-——— 

t^o 2M{Kt) 




To derive the equality, we used M"(0) = Var [Ci^n] = and M'(0) = 0, M(0) = 

1 . ’ □ 


Proof of Lemma 18. 

Proof. Noticing the lower bound in (52), we have 


E [G{2 - Zn- Vnpn)] > 




> -E 
- 2 




On the other hand, using the upper bounds in (53), we have for fc = 1, 2,3, 

-4„(^) <2 r . -:-=-x-r < 2- 


dZ’^ 


E 




Since x ^ is convex on the positive axis, it is hard to see 
1 


< E 


= («|2-^|) 


< e^'^E [e'"^ + . 


E [e(-«|2-z|)] 

Thus using Lemma 17, we know E —?► exp(K2/2). Thus, we conclude 

sup^^4„(Z) < C'iexp[K|Z|], fc=l,2,3. 


(55) 


To verify the above inequality for k = 0. Notice that for > 0, G(2 — Z — ^JnjjLn) < 
G(2 — Z). Thus the denominator of if^{Z) is bounded below using the argument 
above. For /r < 0, 

G(2 — Z — 'i/n^n) < exp(— k(2 — Z — y/n^n) < exp(—K-\/n |p„| + k\2 — Z\). 


The term exp(—KyTt |p.n|) cancels with the one in the denominator, thus (55) holds for 
fc = 0 as well. 

Analogously, similar bounds can be derived for the derivatives of P{Z) as well, thus 
we have the conclusion of the lemma. □ 


C.l. Proof of Lemma 4 and Lemma 10 


The proof of Lemma 4 is a simple application of Lemma 17 and Lemma 18. 
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Proof. By law of large numbers, we know that is consistent for /i unselectively. 
Thus, using the result by Lemma 3, we only need to verify that the selective likelihood 
is integrable in For simplicity, we take q = 2. 

First notice from (54) that the selective likelihood ratio is bounded by a multiple of 
exp[/c|Z|]. Then by Lemma 17, 

limsupEp^ [£][r^(X)^] < 2Cf exp{2Kf). 

n 

□ 


The proof of Lemma 10 uses results in Lemma 18 and Lemma 15 

Proof. It follows simply from (54) that condition (28) are satisfied with the norm func¬ 
tion n simply being the absolute value function. Therefore, we only need to verify (30). 
Note for /i„ = /r < 0 

xQ + W > 2] — P$xQ [\/tr^n "h W > 2] 

P<I>xQ [^/nXn UJ > 2] 

[G{2 -Z- yGifi)] - E$ [G{2 -Z- v^^)] 

“ [G(2 - Z - Vnp)] 

exp{—K^/nfJ.)Ep^ [G{2 — Z — ^/nM)] ~ [G{2 — Z — v^f)] 

“ E$ [exp(— k|2 — Z\)] 

<2E$ [exp(2K -I- 2|Z|)] • C exp(— 

<2C'exp(K^)n“^/^, 

The second to last inequality uses Lemma 15 and the fact that Ag (G) < exp{Ks/np). 
For /Tn = /r > 0, the denominator P$xQ [\/nXn -f w > 2] is bounded below, and G 
has bounded derivatives. Therefore, a simple application of the Berry-Esseen Theorem 
will suffice. □ 

Appendix D: Proofs related to afflne selection regions 
D.l. Proof of Lemma 12 

The quantity that appears in both the pivot and the selective likelihood ratio is 

Q(z; A) = E{A{z + X)+oj€K)= [ G{dw), 

Jk-A{z+A) 

where oj ^ G. The associated selective likelihood in terms of z is 

p (--A) 

^ 4,Q(t;A)F(df)- 

We first rewrite the pivot in terms of U. 

Qft; Lz, A) exp(-fV2cr^) dt 
JfoQ{t;Lz,A)exp{-ty2a‘^)dt' 


( 56 ) 
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where we use the slight abuse of notation for the one dimensional function Q(f; Lz, A) 
L = I - 

Q(t; Lu, A) = P(f • -^AY,rj + ALu + AA + lu € K) 

= [ G{dw). 

JK-t-A AHr]-ALu-A\ 

We first establish a lower bound on Ep [Q{Z; A)] = Pf„xQ [-^{Z + A) + w € /f] 
under the local alternatives. 

Lemma 19. If we assume the lower bound condition, then under the local alternatives 
with radius B, i.e. dh{0, K — AA) < B, we have 


iu-,A)¥{du) >C-C{^,h)-e 


-B 


where C'($, h) is a constant only depending on the normal distribution $ = A(0, E) 
and the norm h in the local alternatives condition. 


Proof. We first see that the lower bound condition gives the following lower bound. 


(t^;A)= [ 
J h 


j G{dw) > G exp 

— inf h{w) 

'k-A{u+A) 

w^K—A{u-\-A) 


(57) 


Consider 


[ Q(u; A)¥{du) > C [ exp f — inf h{w)] ¥{du) 

Jwp Jwp V n,eK-A{A+u) 

= C~ [ exp [ — inf h(w — Au)] ¥(du) 

7rp V ^^k-a\ J 

> C~ [ exp [ — inf h(w) + h{—Au)] ¥{du) 
Jrp V W&K-AA J 

= C~ ■ exp (— inf h(w'^ [ ¥{du) 

V ^^k-aa 7 V 


= c- 


p-h(Ai 


¥{du) 


..-B 


Finally, since the exp(—has uniformly bounded derivatives up to the third 
order, we have 


exp(— 


exp{—h{Au))^{du) 


as Z —> N(0, E) in distribution. Let C'(<i>, h) = eAjp{—h{Au))^{du), and we will 
have the conclusion of the lemma. □ 
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The following lemmas establish the bounds on the derivatives for the likelihood 
function and the pivot P(u; A). Lemma 12 is easily obtained using Lemma 20 and 
Lemma 21 below. 

Lemma 20. Suppose the smoothness and the lower bound conditions are satisfied, 
then for local alternatives with radius B, 


$ = 7V(0,E). 


(58) 


Proof. The smoothness condition implies the following upper bound. For a multi-index 
a, we have 




' g{w)dw 

K-A{A+z) 

f d°‘ 

' - Az)dw. 

K-AA OZ 


Therefore, from the smoothness condition. 




< C(A)C„ Ct{A). 


(59) 

□ 


This combined with Lemma 19 gives the conclusion of the lemma. 

Next, we derive the exponential bounds on the derivatives of the pivot P{z] A) with 
respect to z. 

Lemma 21. Assuming the conditions of Lemma 12, for a multi-index a up to the order 
d , < n* . p‘^-t-ip{h)\\ALz\\2 


dz‘ 


:Piz;A) 


where the norm on the left is the element-wise maximum and C is independent of{z, A) 
and Lip{h) is the Lipschitz constant of h with respect to £2 norm. 

Proof. To get a lower bound on the denominator, note (57) 


; Lz, A) > C exp 
> C~ exp 


inf h{w — t ■ -—ASA]) 


w£K—ALz — AA 


w^K-ALz-AA ' 


Therefore, the denominator will be lower bounded by 


1 


; Lz, A) exp(-f^/2cr^)df 


J-i 

>C~ exp [h{AT,r]/a'^)'^a^/2] ■ exp 
>C- exp [L(ASr,/a2)V2/2] 


inf h{w) 

w^K—ALz—AA 
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On the other hand, the upper bound (59) ensures, 






exp(-i^/2cr^)di < C^iA). 


Note the derivatives of the pivot will be a polynomial in terms of the form. 


daQ{t; Lz, A) expi-t"^/2a‘^)dt 
JZo exp{-t‘^/ 2 a‘^)dt 

and therefore, it is easy to get the conclusion of the lemma. 


□ 


D.2. Proof of Lemma 13 

Using Lemma 19 and the following lemma, we can easily prove Lemma 13. 

Lemma 22. Let Zn = y/n{Tn — tin) G (ind F„ has finite third moments 7. 
Moreover, suppose the randomization noise oj G Q, a probability measure on Then 
for any sequence of sets (C/„)„>i, Un C x we have 

[{Zn, to) G Un] — P$„xQ [(^n; w) G C/„] | < 2 , 

where = iV(p(F„), E(F„)) and C 3 is a constant depending only on p. 

Proof of Lemma 22 uses the well known results of Berry-Esseen Theorem. A mul¬ 
tivariate extension can be found in Gotze (1991). 

Proof. For each w, we denote 

Un{ui) = {ZgW : {Z,UJ) G Un} C RP. 

Thus the difference in the two probabilities is 

|Pf,iXQ [{Z,^^) G Un] — F$„xQ [{Z,Uj) G Un]] 

<Eq [|Pf„ [Z G Un{ui)] -V^]ZG Un{oj)]\] 

< sup |F„(C7) - $„(C/)| < C 37 n■'/^ 

C/GRp 


where only depends on the dimension p. The last inequality is a direct application 
of equation (1.5) in Gotze (1991). □ 








